-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix][test] Fix ManagedCursorTest and NonDurableCursorTest flaky tests #25101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][test] Fix ManagedCursorTest and NonDurableCursorTest flaky tests #25101
Conversation
…kDeleteBlockingWithMultiShots flaky test
…thMultiShots flaky test
managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java
Outdated
Show resolved
Hide resolved
managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java
Outdated
Show resolved
Hide resolved
managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java
Outdated
Show resolved
Hide resolved
managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java
Outdated
Show resolved
Hide resolved
managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java
Outdated
Show resolved
Hide resolved
2374c98 to
12c1124
Compare
managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java
Outdated
Show resolved
Hide resolved
managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java
Outdated
Show resolved
Hide resolved
managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java
Outdated
Show resolved
Hide resolved
managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java
Outdated
Show resolved
Hide resolved
managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java
Outdated
Show resolved
Hide resolved
managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java
Outdated
Show resolved
Hide resolved
managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java
Outdated
Show resolved
Hide resolved
managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java
Outdated
Show resolved
Hide resolved
managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the comments about preferring AssertJ
|
@oneby-wang There seems to be a few more locations where the same flakiness pattern occurs. please check at least |
|
Got the two above comments. |
|
@lhotari Maybe we should do pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java Lines 3010 to 3012 in fbab357
|
@oneby-wang I guess that could make sense, however the problem is that there could be multiple cursors at the slowest position. However, it might be better to handle it separately. Just wondering if it's already handled by the pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java Lines 3109 to 3116 in fbab357
|
I read this code. If we mark delete 12:9, next ledger is an empty ledger: 13:-1, seems this method will only move mark delete position to 12:8. If ledger 12 is trimmed, then causing inconsistency. I'm wondering if mark delete position with -1 entryId is a valid mark delete position? Seems some codes intend to avoid this. pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java Lines 3234 to 3236 in fbab357
pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java Lines 622 to 632 in fbab357
EDIT: if a cursor subscribe an empty managed ledger, then it's mark delete start with ledgerId:-1, it is a valid mark delete position. |
@oneby-wang Please address this so that we could get at least some improvements made to the flakiness. The |
|
@lhotari Working. |
good observations. It seems that there's a lot of code locations where it's hard to determine which is the correct behavior. |
|
I think maybe we can fix cursor position and ledger inconsistency by |
lhotari
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, however it seems that BadVersionException could occur in all cases where the ledger is opened with a different instance without closing the previous one. I think that retry would be needed to address those cases. We could handle that in a separate PR since this PR will already improve significantly the current CI flakiness.
Thanks for the great work, @oneby-wang
|
I’ll revisit the BadVersionExceptions over the next few days and submit a PR to address it. Thanks for reviewing and your great suggestions. @lhotari |
I think just in rollover and reopen race condition case. If ledger is not rolling over, which will not update metastore, so reopen ledger will success to update metastore without throwing BadVersionException. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #25101 +/- ##
=============================================
+ Coverage 39.14% 74.49% +35.35%
- Complexity 13441 33679 +20238
=============================================
Files 1842 1899 +57
Lines 145491 149655 +4164
Branches 16907 17393 +486
=============================================
+ Hits 56949 111482 +54533
+ Misses 80889 29311 -51578
- Partials 7653 8862 +1209
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Fixes #25092
Motivation
ManagedCursorTest.asyncMarkDeleteBlocking()test is very flaky after PR #25087, so fix it.Modifications
So I modify
if (!lastAckedPosition.equals(cursor.getMarkDeletedPosition()))toif (lastAckedPosition.compareTo(cursor.getMarkDeletedPosition()) > 0)inManagedLedgerImpl.Modify some typos, and add some todos(could be deleted if not needed) in
ManagedLedgerImpl.Add simplest test case method:
asyncMarkDeleteBlockingWithOneShot(), see method comments. I movec1.asyncMarkDeleteout ofasyncAddEntry#addCompletecallback method, because the two operations will compete for the same ledgerId due to race condition, which is the reason whyasyncMarkDeleteBlocking()test is more flaky after [fix][broker] Fix cursor position persistence in ledger trimming #25087. Some logs if I putc1.asyncMarkDeleteinasyncAddEntry#addCompletecallback method.Add
asyncMarkDeleteBlockingWithMultiShots()test to reproduce race condition betweenc1.asyncMarkDeleteandasyncAddEntry#addComplete, see method comments.Fix
asyncMarkDeleteBlocking()flaky test, see method comments.a. Fix
MetadataStoreException$BadVersionExceptiondue to race condition between ledger rollover and ledger recovery read.b. Fix assert failure due to race condition race condition between
c1.asyncMarkDeleteandasyncAddEntry#addCompleteAnd more tests, see PR discussion.
Verifying this change
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
docdoc-requireddoc-not-neededdoc-completeMatching PR in forked repository
PR in forked repository: oneby-wang#16